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Abstract 

Document  imaging  technology  has  developed  to  the  point  where  it  is  not  uncommon  for 
organizations  to  scan  large  numbers  of  documents  into  databases  with  little  or  no  index 
information.  This  may  be  done  for  archival  purposes,  in  which  case  the  necessary  index 
may  be  as  simple  as  a  case  number,  or  with  the  ultimate  goal  of  automatically  extracting 
index  information  for  content-based  queries.  Maintaining  the  integrity  of  such  a  database 
is  difficult,  especially  in  a  distributed  environment  where  copies  of  documents  with  differ¬ 
ent  physical  histories  may  be  scanned  at  different  times.  In  this  paper  we  present  a  novel 
approach  to  detecting  duplicate  documents  in  very  large  databases  using  only  features 
extracted  from  the  image.  The  method  is  based  on  a  robust  “signature”  extracted  from 
each  document  image  which  is  used  to  index  into  a  table  of  previously  processed  docu¬ 
ments.  The  system  is  able  to  deal  robustly  with  differences  between  scanned  documents 
with  respect  to  such  factors  as  resolution,  skew  and  image  quality.  The  approach  has  a 
number  of  advantages  over  OCR  or  other  recognition-based  methods  including  speed  and 
robustness  to  imaging  distortions.  To  justify  the  approach  and  demonstrate  its  scalabil¬ 
ity,  we  have  developed  a  simulator  which  allows  us  to  change  parameters  of  the  system 
and  examine  performance  while  processing  millions  of  document  signatures.  A  complete 
system  has  been  implemented  and  tested  on  a  collection  of  technical  articles  and  memos. 
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Abstract 

Document  imaging  technology  has  developed  to  the  point  where  it  is  not  uncommon 
for  organizations  to  scan  large  numbers  of  documents  into  databases  with  little  or  no 
index  information.  This  may  be  done  for  archival  purposes,  in  which  case  the  necessary 
index  may  be  as  simple  as  a  case  number,  or  with  the  ultimate  goal  of  automatically 
extracting  index  information  for  content-based  retrieval.  Maintaining  the  integrity  of  such 
a  database  is  difficult,  especially  in  a  distributed  environment  where  copies  of  documents 
with  different  physical  histories  may  be  scanned  at  different  times. 

In  this  paper  we  present  a  novel  approach  to  detecting  duplicate  documents  in  very 
large  databases  using  only  features  extracted  from  the  image.  The  method  is  based  on 
a  robust  “signature”  extracted  from  each  document  image  which  is  used  to  index  into 
a  table  of  previously  processed  documents.  The  system  is  able  to  deal  robustly  with 
differences  between  scanned  documents  such  as  resolution,  skew  and  image  quality.  The 
approach  has  a  number  of  advantages  over  OCR  and  other  recognition-based  methods 
including  speed  and  robustness  to  imaging  distortions. 

To  justify  the  approach  and  demonstrate  its  scalability,  we  have  developed  a  simulator 
which  allows  us  to  change  parameters  of  the  system  and  examine  performance  while 
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processing  millions  of  docnment  signatnres.  A  complete  system  has  been  implemented 
and  tested  on  a  collection  of  technical  articles  and  memos. 
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1  Introduction 


Steady  increases  in  compntational  power  and  affordable  storage  have  allowed  very  large 
heterogeneons  databases  to  be  considered  as  a  viable  means  of  archiving  or  storing  doc- 
nment  images.  It  is  not  nncommon  to  see  docnment  collections  in  excess  of  one  million 
images,  and  in  today’s  distribnted  environments,  the  management,  storage,  and  retrieval 
of  these  collections  have  become  important  issnes.  One  primary  consideration  is  in  the 
generation  of  the  index  information  nsed  to  identify  the  database  object.  Traditional 
database  indexes  may  contain,  for  example,  administrative  data,  docnment  ID  numbers, 
and  possibly  a  small  number  of  keywords  which  are  provided  or  can  be  extracted  directly 
from  the  data.  Image  databases,  however,  typically  reqnire  the  mannal  entry  of  index 
information  since  adeqnately  expressive  indexes  are  not  alway  available.  When  dealing 
with  millions  of  docnments,  mannal  indexing  is  typically  not  cost-effective. 

A  second  concern  is  that  in  an  image-based  system,  traditional  organization,  search 
and  retrieval  techniqnes  are  not  ideal,  in  part  becanse  of  the  sheer  volnme  of  information 
that  images  contain.  Althongh  docnments  are  a  written  representation  of  a  langnage, 
a  docnment  in  image  form  lacks  type  content  information  typically  available  with  text 
docnments.  If  accnrate  and  nniqne  index  information  is  available  for  images,  many  of  the 
operations  which  are  handled  by  traditional  database  systems  is  trivial.  In  cases  where 
index  information  is  not  available,  indexing  and  retrieval  remain  difficnlt  and  challenging 
research  problems.  In  this  paper  we  explore  one  snch  problem,  the  problem  of  detecting 
dnplicate  docnment  images,  in  the  absence  of  appropriate  index  information. 

Consider  a  sitnation  where  thonsands  of  docnments  are  being  imaged  and  added  to 
a  single  heterogeneons  database,  possibly  from  a  distribnted  environment.  If  mnltiple 
instances  of  the  same  docnment  exist,  they  may  be  re-processed  or  re-entered  nnneces- 
sarily.  This  rednndancy  in  the  database  may  not  be  desirable  for  a  nnmber  of  reasons, 
inclnding  increased  storage  cost,  difficnlties  in  maintaining  database  integrity,  increased 
processing  cost  for  database  operations,  and  cost  of  indexing  mnltiple  images  with  the 
same  nnderlying  content. 

It  is  therefore  desirable  to  have  a  preprocessing  mechanism  for  detecting  dnplicate 
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instances  of  a  docnment  image  prior  to  either  indexing  it  or  adding  it  to  a  database. 
Natnrally,  the  dehnition  of  a  “dnplicate  instance”  is  open  to  some  interpretation,  and 
may  be  dehned  differently  for  different  applications.  The  level  of  similarity  between 
docnments  is  a  key  factor  in  identifying  what  constitntes  a  dnplicate.  We  will  dehne 
three  levels  of  similarity. 

A  hrst  level  of  similarity  contains  identical  docnments.  These  are  docnments  which 
are,  for  all  practical  pnrposes,  the  same  even  at  the  image  level.  These  docnments 
most  often  resnlt  from  a  scan  of  a  single  original  mannscript,  and  mnltiple  copies  of  the 
docnment ’s  image  hie  being  distribnted  electronically.  We  can  assnme  that  the  images 
vary  only  in  the  hie  format  or  hie  representation  and  shonld  be  easily  identihed  by 
performing  either  a  byte-by-byte  hie  or  a  pixel-by-pixel  image  comparison.  A  second 
level  of  similarity  contains  image-variant  docnments  which  are  images  that  arise  from 
different  instances  of  the  same  original  docnment.  The  originals  may  have  been  scanned 
at  different  times,  and  may  have  independently  nndergone  varions  types  and  levels  of 
physical  degradation.  This  is  often  observed,  for  example,  with  pnblished  docnments 
that  were  originally  distribnted  to  mnltiple  sites  (i.e.  technical  reports,  copies  of  memos, 
etc.).  Image- variant  dnplicates  are  identical  with  respect  to  content  and  strnctnre,  bnt 
may  differ  snbstantially  at  the  pixel  level  and  cannot  be  easily  identihed  from  pixel-by- 
pixel  comparisons.  A  third  level  of  similarity  contains  docnments  which  vary  in  strnctnre, 
bnt  contain  essentially  the  same  content.  This  is  common,  for  example,  when  a  docnment 
is  originally  scanned  and  entered,  and  later,  a  revised  or  reformatted  version  of  the  same 
docnment  is  added  to  the  database.  We  will  not  address  the  problem  of  structure-variant 
documents  since  the  extent  of  variation  is  open-ended. 

We  are  cnrrently  addressing  the  problem  of  detecting  image- variant  dnplicates,  where 
mnltiple  instances  of  an  effectively  identical  original  sonrce  are  scanned  for  incorporation 
into  a  database.  The  original  docnments  may  have  been  written  on,  stapled,  torn,  taped, 
or  may  have  pages  missing  or  a  cover  added.  The  docnment  may  have  been  copied 
repeatedly,  so  different-generations  of  copies  may  be  involved.  The  docnment  may  have 
been  scanned  at  different  times  and  on  different  devices,  so  resolntion,  illumination,  and 
contrast  may  also  be  issnes.  Skew  and  translation  may  resnlt  in  additional  distortion. 
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The  goal  of  our  project  is  to  analyze  documents  which  are  candidates  for  being  added  to 
a  database,  so  that  when  variations  of  an  existing  document  are  presented,  the  system 
is  able  to  identify  the  duplicates  and  not  process  them  further.  The  ability  to  process 
documents  without  prior  indexing  is  essential  if  the  practical  use  of  large-scale  document 
image  databases  is  to  be  successful. 

In  Section  2  of  this  paper,  we  provide  a  brief  overview  of  the  problem  of  duplicate 
document  image  detection  and  our  proposed  approach.  In  Section  3,  we  discuss  the 
systems  feasibility  and  present  results  obtained  using  a  simulator  which  allows  us  to  test 
indexing  mechanisms  that  involve  millions  of  indices.  In  Section  4  we  provide  details  of 
our  implementation  and  interface.  Finally,  we  discuss  experimental  results  on  a  database 
of  technical  articles  and  memos  in  Section  5  and  provide  a  brief  discussion  in  Section  6. 

2  Duplicate  Document  Identification 
2.1  Problem  Overview 

Let  us  assume  that  document  images  are  scanned  with  the  intention  of  adding  the  images 
directly  to  a  database.  Depending  on  what  information  is  available  a  priori  in  the  system, 
the  problem  of  duplicate  detection  can  be  approached  in  a  number  of  ways.  If,  for 
example,  basic  index  information  such  as  the  document  number,  date,  title,  authors  or 
number  of  pages  is  entered  manually,  this  information  could  serve  as  a  preliminary  hlter 
for  duplicates.  In  most  cases,  however,  high-volume  operations  prohibit  such  manual 
entry  prior  to  scanning.  Instead,  we  would  like  to  identify  duplicates  directly  from  their 
images.  Although  we  have  basic  quantitative  information  such  as  the  number  of  pages, 
at  this  point  we  consider  only  the  analysis  of  the  image  itself. 

One  possible  solution  which  has  been  proposed  is  to  apply  Optical  Character  Recog¬ 
nition  (OCR)  to  the  document  image  and  match  as  much  text  as  possible  between  the 
documents.  Although  this  matching  can  be  done  relatively  quickly,  OCR  performance 
suffers  on  degraded  documents  in  terms  of  both  accuracy  and  speed.  For  this  reason,  we 
do  not  feel  that  OCR  is  feasible  as  a  hrst-level  hlter,  but  OCR  may  be  used  as  a  secondary 
hlter,  to  reduce  the  number  of  possible  matches  from  hundreds  to  tens  of  documents. 
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Any  approach  which  we  choose  should  have  a  number  of  properties  for  it  to  be  con¬ 
sidered  as  a  feasible  solution  in  this  domain.  First,  we  are  constrained  by  the  fact  that 
we  may  be  dealing  with  millions  of  documents,  many  of  which  may  be  highly  degraded. 
To  cope  with  such  situations,  we  must  have  a  signature  for  each  document  which  is 

Robust  -  The  signature  should  be  reliably  extractable,  even  when  the  document  be¬ 
comes  degraded. 

Unique  -  Although  we  cannot  realistically  expect  the  signatures  to  be  unique  unless 
we  use  an  excessively  large  feature  set,  a  given  signature  should  be  associated  with 
several  tens  of  documents  at  most. 

Compact  -  The  storage  capacity  required  to  hold  the  signatures  of  millions  of  docu¬ 
ments  may  be  very  large,  so  the  index  keys  should  be  as  small  as  possible. 

In  addition,  the  algorithms  which  extract  the  signature  must  be 

Fast  -  Algorithms  which  take  minutes  to  extract  a  signature,  and  then  attempt  to 
match  it  against  each  document  in  the  database,  are  not  acceptable.  The  applica¬ 
tion  demands  rapid  extraction,  and  near  constant  time  indexing  into  the  database 
of  previously  entered  documents. 

Scalable  -  Initially  the  algorithms  will  work  on  hundreds  of  documents,  but  as  more 
documents  are  processed,  and  non-duplicates  added,  the  size  of  the  database  could 
grow  to  tens  of  millions. 

Accurate  -  It  is  acceptable  to  miss  a  small  percentage  of  duplicates  since  the  result 
is  simply  that  the  same  document  is  entered  twice,  but  identifying  documents  as 
duplicates  when  they  are  not  (false  alarms)  is  not  acceptable. 

The  overall  goal  of  a  duplicate  detection  system  should  be  to  either  1)  determine 
that  the  document  is  not  a  duplicate  or  2)  accurately  identify  fO-20  documents  of  which 
it  could  be  a  duplicate.  In  the  latter  case,  other  methods  of  analysis  such  as  OCR, 
structural  analysis,  or  human  verihcation  can  be  used  as  post-processors  to  eliminate 
non-duplicates  and  conhrm  duplicates. 
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Our  preliminary  experiments  have  allowed  us  to  draw  some  general  conclusions  about 
the  types  of  algorithms  that  will  and  will  not  work.  First,  algorithms  which  attempt  any 
in-depth  analysis  of  the  document  (such  as  OCR  or  structural  analysis)  are  not  ideal 
because  they  cannot  extract  the  features  (recognized  characters,  in  the  case  of  OCR) 
robustly  enough  for  degraded  documents,  and  because  the  resulting  index  information 
would  be  too  extensive.  Second,  any  matching  scheme  which  requires  comparison  of 
the  extracted  features  to  a  signihcant  portion  of  the  database  will  fail  because  of  the 
computational  requirements.  An  efficient  indexing  scheme  is  required. 

We  have  developed  an  approach  which  promises  to  fulhll  most  of  our  criteria.  The 
approach  is  based  on  a  robust  signature  consisting  of  shape  codes  extracted  from  the 
textual  components  of  the  document.  In  the  next  section,  we  describe  our  approach, 
some  experiments  we  have  run  to  show  its  effectiveness,  and  research  issues  which  must 
be  addressed  in  order  to  develop  the  working  system. 

2.2  Basic  Approach 

Our  approach  is  based  on  the  extraction  of  a  signature  from  a  representative  line  of  text 
in  the  document  image  using  a  shape  coding  technique.  The  technique  has  been  used 
by  a  number  of  authors  including  Tanaka  [8]  and  Spitz  [7]  for  other  document  analysis 
applications.  Shape  coding  labels  the  symbols  in  the  line  of  text  based  on  very  simple 
shape  properties,  such  as  whether  the  symbols  are  ascenders,  descenders,  limited  to  the 
X-line,  multi-component,  or  punctuation,  for  example.  These  properties  are  much  more 
robust  to  noise  than  the  features  necessary  for  OCR,  and  can  be  extracted  fairly  rapidly. 

To  extract  the  signature,  the  document  is  scanned  for  a  representative  sample  of  text, 
typically  on  the  order  of  50  symbols,  on  a  single  line  or  across  several  lines.  From  this 
sample,  the  base-line,  x-line,  ascender-line  and  descender-line  are  identihed,  and  each 
character  component  is  assigned  a  shape  code  as  shown  in  Figure  1. 

The  string  of  shape  codes  assigned  to  the  characters  in  the  text  sample  is  used  as 
a  signature  for  the  document.  A  level  of  robustness  is  added  by  indexing  based  on  n- 
grams  of  the  shape  code  string,  rather  than  attempting  to  use  an  index  based  on  the 
entire  string.  Fach  shape  code  n-gram  serves  as  an  index  key  into  the  database.  A 
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Figure  1:  Sample  character  shape  code  assignment 
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Figure  2:  Overview  of  the  indexing  scheme. 


single  dropped  or  inserted  code  may  affect  at  most  n  of  these  keys  but  will  not  affect  the 
entire  signature.  Figure  2  shows  the  relationship  between  the  signature,  its  keys,  and  the 
database.  When  a  set  of  keys  is  presented  for  indexing,  each  key  results  in  a  collection  of 
hits  from  the  database.  Fach  hit  is  a  vote  for  a  document,  and  a  ranked  list  of  documents 
can  be  returned. 

Clearly,  a  number  of  additional  issues  should  be  addressed  in  developing  a  system 
that  satishes  the  criteria  set  forth  above.  These  include: 


•  use  of  global  classihers  -  number  of  pages,  page  component  statistics,  etc.  as  hrst- 
level  biters  to  reduce  the  duplicate  search  space. 

•  choice  of  a  shape  code  alphabet  -  selection  of  features  to  incorporate  into  the  sig¬ 
nature  which  provide  maximum  discrimination. 

•  extraction  of  features  -  how  to  select  the  signature  in  the  image  of  a  document. 
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•  database  organization  and  indexing  -  how  to  create  efficient  ways  to  index  into  large 
collections. 

•  verification  of  candidate  dnplicates 

All  of  these  issnes  will  be  addressed  in  the  design  phase. 

2.3  Related  Work 

The  detection  of  dnplicate  or  near-dnplicate  docnments  has  been  a  problem  of  interest 
for  some  time  in  many  helds,  bnt  has  not  been  addressed  nntil  recently  for  collections  of 
images.  Some  example  domains  inclnde  Edncation,  for  detection  of  plagiarism  [5];  Pnb- 
lishing,  for  detection  of  nnanthorized  copies  [3,  6];  Databases,  for  maintaining  database 
integrity;  Information  Retrieval,  for  information  hltering  [9];  and  in  the  USENIX  com- 
mnnity,  for  detecting  dnplicate  hies  [10]. 

In  the  docnment  commnnity,  most  of  the  work  on  identifying  similar  docnments  has 
been  done  nsing  either  ASCII  docnments,  or  “water-marked”  electronic  representations. 
Mnch  less  work  has  been  done  with  docnment  images.  A  notable  exception  is  Hnll  [2], 
who  describes  a  method  for  matching  docnments  which  have  the  same  character  content 
bnt  which  may  have  been  reformatted  or  distorted  prior  to  re-imaging  (content- variant 
docnments).  Hnll’s  approach  represents  each  docnment  by  a  set  of  robnst  local  featnres 
which  can  be  nsed  to  hash  into  a  database  of  descriptors.  The  featnres  in  both  the  qnery 
example  and  the  database  mnst  be  invariant  to  geometric  distortions;  by  extracting 
mnltiple  descriptors  from  each  docnment,  they  can  also  be  made  robnst  to  errors  in 
featnre  extraction.  The  measnre  of  similarity  is  simply  the  nnmber  of  featnres  the  qnery 
docnment  and  the  database  instance  have  in  common.  Experiments  were  performed  nsing 
as  featnres  the  character  connts  for  each  word  in  short  seqnences  of  words;  this  provided 
a  set  of  simple  yet  robnst  featnres  that  was  adeqnate  for  small  databases.  With  as  few 
as  ten  featnres,  100%  accnracy  was  obtained  for  a  clean  qnery  string  and  a  small  clean 
database.  Unfortnnately,  as  the  nnmber  of  docnments  grows,  the  character  connt  metric 
becomes  less  discriminatory. 
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3  System  Feasability 


3.1  Theoretical  Analysis 

Before  implementing  and  testing  our  approach  on  real  data,  we  performed  a  theoretical 
analysis  to  see  if  our  design  is  realistic  and  if  it  is  robust  to  errors  in  signature  extraction. 
The  parameters  that  were  varied  in  our  analysis  included  system-dependent  variables 
such  as  hie  size  limitations  of  the  operating  system,  disk  access  time  and  disk  transfer 
rate;  database  variables  such  as  the  number  of  documents,  the  size  of  the  index  table 
and  the  average  size  of  the  documents;  and  algorithm  variables  such  as  the  size  of  the 
signature  alphabet,  the  size  of  the  signature  and  the  key  or  n-gram  size. 

The  analysis  yielded  qualitative  estimates  of  the  expected  size  of  the  database,  the 
computational  requirements  for  matching  signatures,  and  the  number  of  missed  and  false 
duplicate  detections  as  functions  of  the  database  size.  It  was  found  that  the  system  could 
be  implemented  with  generally  available  hardware. 

The  details  of  the  analysis  are  given  in  Appendix  A. 

3.2  Simulation  Analysis 

To  demonstrate  the  technical  feasibility  of  our  approach  to  coding  indexing  and  retrieval 
we  performed  several  experiments  using  ideal  and  corrupted  shape  code  data.  A  simulator 
was  developed  which  allowed  us  to  explore  a  variety  of  coding,  indexing  and  database 
organization  scenarios,  without  the  need  to  address  feature  extraction  issues.  The  goal 
was  to  show  that  signatures  can  be  obtained  which  are  unique  enough  to  be  used  for 
indexing  and  that  the  database  of  indexes  scales  appropriately. 

Our  simulator  takes  as  input  ideal  ASCII  text  and  maps  the  characters  determin¬ 
istically  into  appropriate  shape  codes,  thus  simulating  perfect  feature  extraction  (see 
Figure  3).  Using  the  resulting  signature,  we  can  explore  various  indexing  options,  and 
test  the  uniqueness  of  the  signatures  on  large  databases.  Since  text  databases  are  widely 
available,  a  large-scale  system  can  be  simulated  at  relatively  low  cost. 


to 
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Figure  3:  Simulator  overview 


3.2.1  Experiment  1 

In  our  first  experiment,  we  used  a  small  database  so  that  we  could  examine  and  track 
individual  signatures  through  the  system.  We  extracted  5000  lines  of  text  from  an  elec¬ 
tronic  version  of  the  Wall  Street  .Journal  each  of  which  contained  at  least  50  symbols. 
Each  line  was  treated  as  an  independent  document  so  the  database  would  represent  5000 
document  images  (i.e.  5000  signatures  of  length  >  50).  We  chose  a  shape  code  alphabet 
of  size  8  (shown  in  Table  1),  and  a  key  length  of  5  (i.e.,  a  set  of  5-grams  was  generated 
from  each  signature).  An  example  of  a  text  line  is  shown  in  Figure  4  along  with  its  shape 
code  signature  and  some  of  the  dehned  index  keys.  The  keys  were  then  stored  in  the 
database. 

Our  indexing  experiments  were  divided  into  two  parts.  Part  I  was  designed  to  examine 
the  distribution  of  scores  for  known  duplicates,  corrupted  duplicates  and  non-duplicates 
matched  against  the  database.  This  would  give  us  an  idea  of  the  uniqueness  of  the 
signature  keys.  Part  II  was  designed  to  look  at  the  distribution  of  the  ranks  of  retrieved 
duplicates  in  the  top  20  positions.  This  would  allow  us  to  judge  the  effect  of  noise 
on  individual  instances.  To  address  the  robustness  of  feature  extraction,  we  further 
enhanced  the  simulator  by  building  a  noise  and  degradation  model  into  the  system. 
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Table  1:  Table  of  shape  codes  and  symbols  to  which  they  apply 
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Table  2:  Scores  of  line  831  in  [candidate  (score)]  format. 


1  Kank  1 

1  Error _ L 

1  1 

2 

1  3 

1  ^  1 

1  3 

6 

1  7 

1  8 

9 

1 _ c _ 1 

1  831(46) 

3984(16) 

1  839(14) 

3734(14) 

834(14) 

3752(13) 

828(13) 

3990(13) 

3749(12) 

iM-mtCIM 

708(10) 

1029(10) 

2788(14) 

2789(13) 

4474(10) 

4498(10) 

834(9) 

'■gareiM 

1990(7) 

3909(6) 

541(5) 

831(9) 

790(6) 

839(6) 

1  984(6)  1 

1  1888(6)  1 

2394(6) 

2397(6) 

1  2402(6)  1 

2589  (6) 

1 _ 2!! _ 1 

387(3) 

1421(3) 

2066(3) 

2990(3) 

Table  3:  Scores  of  line  5416,  which  was  not  in  onr  database,  in  [candidate  (score)]  format. 


2 

3 

8 

8 

7 

8 

8 

0 

^ggl2lg 

3390(10) 

4598(10) 

4530(9) 

5 

1872(6) 

1959(5) 

10 

3390(8) 

2684(6) 

1823(6) 

4530(6) 

3373(5) 

62(5) 

4489(5) 

1636(5) 

4546(5) 

15 

2726(5) 

2838(5) 

39  7(4) 

732(4) 

941(4) 

159^4) 

2027(4) 

2097(4) 

2498(4) 

20 

ge^glglg 

4727(8) 

gegggg 

3512(7) 

Table  3  shows  the  match  scores  for  snch  a  text  line.  A  combination  of  low  scores  and 
the  similarities  among  the  top  scores  give  ns  an  indication  that  this  candidate  signatnre 
is  not  a  dnplicate. 

It  is  interesting  to  note  that  the  scores  for  a  text  line  that  is  not  in  the  database  (i.e.  a 
non-dnplicate  text  line)  are  higher  than  those  for  a  text  line  that  is  in  the  database  when 
15  errors  were  introdnced.  This  is  dne  to  the  fact  that  corrnpted  text  begins  to  form 
shape  code  keys  that  conld  not  possibly  correspond  to  words  which  appear  in  the  English 
langnage,  and  thns  it  receives  a  lower  score  when  matched  against  a  real  database.  The 
keys  in  the  non-dnplicate  text  line,  however,  still  correspond  to  valid  keys,  and  receive 
higher  scores.  This  gives  ns  a  qnantitative  measnre  of  the  similarity  between  signatnres 
taken  from  non-dnplicate,  English  text. 

In  Part  II  of  the  hrst  experiment,  we  randomly  chose  100  valid  dnplicate  signatnres, 
matched  them  against  the  5000  signatnres  in  the  database,  and  recorded  how  often  the 
correct  match  was  ranked  in  the  the  top  one,  two,  hve,  and  ten  positions.  The  resnlts 
are  shown  in  Table  4  for  candidates  corrnpted  with  0,  5,  10,  15  and  20  errors.  We  see 
that  the  correct  match  was  consistently  in  the  top  position  when  there  were  10  or  fewer 
errors. 

In  practice,  the  variation  between  two  docnments  which  are  image-variant  dnplicates 
of  each  other  is  typically  dne  to  factors  snch  as  notes,  photocopying  and  aging,  and  to 
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Table  4:  Top  duplicate  candidates  in  100  queries 


Added  Errors 

Top  1 

Top  2 

Top  5 

Top  10 

Top  20 

0 

100 

100 

100 

100 

100 

5 

100 

100 

100 

100 

100 

10 

100 

100 

100 

100 

100 

15 

61 

58 

69 

77 

100 

20 

17 

21 

24 

30 

100 

Table  5:  Number  of  duplicate  candidates  detected  in  2500  queries  from  a  pool  of  one 
million  documents.  _ 


Added  Errors 

Top  1 

Top  2 

Top  5 

Top  10 

Top  20 

0 

2416 

2443 

2458 

2465 

2467 

5 

2379 

2419 

2447 

2457 

2463 

8 

2059 

2202 

2327 

2390 

2431 

10 

1403 

1678 

1905 

2039 

2181 

characteristics  of  scanning  processes  including  resolution,  density  and  skew.  We  therefore 
expect  few  “differences”  in  the  shape  codes  of  duplicate  documents,  so  the  introduction 
of  20  errors  is  more  then  sufficient. 

3.2.2  Experiment  2 

Our  second  set  of  experiments  involved  a  much  larger  number  of  signatures.  We  used 
text  data  similar  to  the  data  used  in  Experiment  1,  but  we  used  the  text  to  create  a  test 
database  of  one  million  signatures.  We  then  chose  5000  signatures  to  simulate  incoming 
documents.  2,500  non-duplicate  signatures  were  chosen  from  a  different  corpus  and  2,500 
random  duplicate  signatures  were  chosen  from  the  Wall  Street  .Journal  corpus. 

We  hrst  ran  the  signatures  of  duplicate  documents  through  the  matching  process 
which  allowed  us  to  study  the  distribution  of  matching  scores.  In  evaluating  the  results, 
we  observed  that  some  matches  produced  scores  greater  than  the  expected  maximum  of 
50.  This  is  because  it  is  possible  for  a  key  to  occur  more  then  once  in  a  signature.  We 
also  noted  that  some  signature  lines  occurred  more  than  once  in  the  database,  resulting 
in  false  duplicates.  The  hrst  line  of  Table  5  shows  the  number  of  detected  duplicates  out 
of  2500  ranked  in  the  top  f,  2,  5,  fO  and  20  positions^. 

Next,  we  ran  corrupted  signatures  of  duplicate  documents  through  the  matching 

^Some  of  the  lines  in  the  original  database  occur  more  than  20  times,  so  the  “true”  duplicate  may  not 
appear  in  the  top  20.  That  is  why  even  in  the  0  error  case,  the  2500  documents  were  not  all  detected. 
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Figure  5:  Percentage  of  duplicate  documents  identified  in  the  top  n  candidate  documents 
retrieved. 

process,  perturbing  them  by  introducing  hxed  numbers  of  errors  (5,  8  and  fO).  The 
numbers  of  duplicate  documents  ranked  in  the  top  1,  2,  5,  10,  and  20  positions  for  these 
levels  of  errors  are  shown  in  the  bottom  three  lines  of  Table  5.  Figure  5  shows  the 
percentage  of  detected  duplicates^  as  a  function  of  the  number  of  documents  retrieved 
for  each  of  the  error  levels.  We  see  that  even  when  there  are  10  errors,  in  a  signature  of 
size  50  we  can  detect  duplicates  more  than  84  percent  of  the  time. 

4  Implementation 

Having  tested  the  index  features  and  the  robustness  of  the  signature  matching  process, 
the  remaining  task  was  to  implement  and  test  the  line  extraction  and  signature  coding 
processes  using  image  data.  With  degraded  documents,  the  most  critical  aspect  of  the 
system  is  its  ability  to  extract  the  same  representative  line  from  multiple  instances  of 


^The  “recall”  of  a  retrieval  system  can  be  defined  as  the  number  of  relevant  documents  which  are 
retrieved  divided  by  the  number  of  relevant  documents  in  the  database.  We  have  plotted  the  number  of 
duplicates  identified,  divided  by  the  total  number  of  duplicates. 
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a  document.  Figure  6  shows  a  candidate  document  and  a  representative  line  extracted 
from  it. 

4.1  Representative  Line  Extraction 

The  representative  line  extracted  from  each  image  is  a  text  line  which  has  a  sufficient 
number  of  characters  to  be  used  as  a  signature.  For  efficiency  we  divide  the  page  image 
horizontally  into  thin  “zones”  and  analyze  the  zones  in  vertical  order  from  top  to  bottom. 
When  a  representative  line  is  found,  we  do  not  process  any  portion  of  the  image  below 
it. 

To  begin,  we  apply  a  single-pass  connected  component  algorithm  to  identify  compo¬ 
nents  within  the  zone  of  interest.  To  deal  with  noise,  we  want  to  eliminate  components 
which  appear  to  result  from  copier  degradation  or  graphics.  We  use  conservative  thresh¬ 
olds  on  the  component  size  to  eliminate  components  whose  sizes  are  less  than  7pts  or 
greater  than  14pts.  Although  this  may  result  in  the  loss  of  some  punctuation  marks 
and  other  small  symbols,  they  are  likely  to  be  lost  in  both  the  original  and  duplicate 
documents,  and  not  affect  the  match. 

Next,  the  symbols  are  grouped  into  “words”  using  a  smearing  algorithm  whose  dis¬ 
tance  is  a  function  of  the  average  component  width.  Thresholds  are  applied  to  the  height 
and  width  of  the  words  to  eliminate  words  which  would  not  likely  contribute  to  a  unique 
signature. 

Using  a  second  smearing  process,  we  group  the  words  into  a  line.  After  line  groups 
are  formed,  we  begin  at  the  top  of  the  zone,  and  search  for  a  valid  line.  A  valid  line  is 
a  line  which  1)  has  a  sufficient  number  of  characters  and  2)  contains  both  ascenders  and 
descenders.  The  hrst  restriction  insures  that  we  have  a  signature  which  is  sufficiently 
long,  and  the  second  restriction  allows  us  to  avoid  lines  such  as  titles  which  may  consist 
entirely  of  capital  letters,  as  well  as  lines  consisting  entirely  of  numeric  data.  If  no  valid 
lines  are  found  in  the  current  zone,  the  next  zone  is  considered  and  the  process  is  repeated 
until  a  valid  signature  line  is  found. 

In  order  to  avoid  begin  fooled  by  running  heads  or  other  information  which  is  not 
unique  to  a  given  document,  we  skip  the  hrst  two  valid  lines,  and  consider  the  third  valid 
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the  formation  of  a  quasi-crystalline  slructuie  is  favored.  Regarding  the  formation  of 
botropic  gel.  which  rcpiesenis  gelling  <mi  standing,  this  is  favored  by  aging,  probably 
cause  the  packets  of  lamellae  are  dispersed,  due  to  swelling  effect,  but  only  to  a  certain 
it  because  in  the  concentrated  suspension,  the  structure  is  so  rigid  that  it  is  unaffected 
^  aging.  But  with  an  increase  in  concentration,  thixotropy  always  increases,  due  to  the 
lation  of  a  card  house  structure  that  is  more  fovored. 

Ijfecr  of  NaCl.  The  general  trend  of  curves  of  aj^rareni  viscosity  and  yield  value  is 
n  nature,  with  a  sharp  maximum  for  low  concentration  of  NaCl  and  a  gradual  low 
m  with  increasing  concentration  of  NaCl  added,  as  shown  in  Figures  5  and  6  and 
ible  2. 

.  The  stabilizing  action  of  elecliolytes  on  cUy  suspensions  is  attributed  to  the  forma- 
n  of  electric  double  layers  on  the  surfoce  of  clay  particles.  Such  potential  forming 


Figure  6:  The  representative  line  chosen  for  a  document  with  mixed  text  and  graphics 


line  as  the  signature  for  the  page,  ff  this  is  done  consistently,  it  will  provide  us  with  a 
meaningful  signature  to  index  the  document. 


4.2  Shape  Code  Extraction 

Once  a  valid  line  is  found,  the  next  step  is  to  extract  its  signature.  The  presence  of 
factors  such  as  uncorrected  skew  may  cause  difficulties  with  shape  codes  by  extracted  at 
the  line  level,  so  we  extract  the  shape  codes  by  considering  a  word  at  a  time. 

We  begin  by  roughly  classifying  the  symbols  in  a  word  into  three  groups  by  height: 
small  symbols  like  punctuation,  medium-height  symbols  such  as  ‘a’  or  ‘c’,  and  tall  symbols 
such  ascenders,  descenders,  parentheses,  etc.  We  use  a  medium-height  symbol  from  the 
middle  of  the  line  to  dehne  an  xheight  hypothesis,  using  the  bottom  of  the  symbol  as 
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XHeight  Hypothesis 


{ 

F 1  giuir  e 


xline 

baseline 


Figure  7:  Example  xheight  hypothesis,  xline  and  baseline 


Xi  g  ea-xy  -  XXaX-oX-YXi 

iggra  S .  Elnl  qIs 


Figure  8:  Shape  coding  results  for  part  of  the  line  shown  in  Figure  6. 


the  baseline  and  its  top  as  the  xline  (see  Figure  7).  Starting  from  this  seed  s 
encode  the  line  outward  in  both  directions.  The  idea  is  to  adaptively  use  ea 
and  its  new  neighbors  to  predict  and  rehne  the  position  of  the  baseline  on  eac 
If  the  symbol  is  a  punctuation  mark  (i.e.  it  does  not  span  the  region  between  t 
and  the  xline,  or  it  extends  into  both  the  ascender  and  descender  regions), 
but  do  not  adjust  the  x  or  baselines.  If  the  character  is  an  ascender,  we  so  C( 
adjust  the  baseline  to  the  bottom  of  the  character.  If  it  is  a  descender,  we  so  i 
adjust  the  xline  to  its  top.  Finally,  if  it  covers  only  the  xheight  region,  we  c 
the  xline  and  the  baseline. 

After  all  the  characters  in  the  word  are  classihed,  holes  are  identihed  and  1 
are  rehned  if  necessary.  Figure  8  shows  an  example  of  the  results  of  coding. 

The  shape  coding  process  is  very  accurate  for  clean  data  which  is  not  sig 
skewedAproblem  arises  when  many  characters  touch;  the  system  may  skip  an  ot 
valid  line.  This  tends  to  occur  when  documents  are  photocopied  repeatedly, 
thresholded,  or  are  signihcantly  degraded  physically.  For  example,  if  the  o 
printed  with  a  laser  printer,  and  our  duplicate  candidate  is  a  third-generatioi 
enough  symbols  may  touch  to  put  us  below  the  signature  length  limit.  But  as  i 
in  Section  3.2,  even  when  there  are  10-15  shape  code  errors,  we  can  still  detect 
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Figure  9:  User  interface. 


4.3  Interface 

A  user  interface  has  been  implemented  in  .Java  to  provide  users  with  an  easy  way  to 
interact  with  a  database  (Figure  9).  It  allows  the  user  to  select  a  database  and  verify 
either  a  single  document  or  group  of  documents.  The  retrieved  documents  are  ranked 
and  presented  for  conhrmation  along  with  a  quantitative  measure  of  similarity.  If  the 
candidate  document  is  not  found  in  the  retrieved  set,  the  user  can  add  it  to  the  database. 
Otherwise,  the  user  can  select  a  retrieved  document  and  mark  it  as  a  duplicate,  mark  it 
as  “similar”  or  mark  it  as  unknown.  Log  hies  are  generated  to  track  documents  through 
processing  and  indexing. 

The  computational  requirements  for  the  system  are  reasonable.  The  image  processing, 
including  thumbnail  generation,  takes  about  7  seconds  per  document,  indexing  into  a 
database  of  one-million  documents  takes  another  2  seconds,  and  displaying  20  candidates 
takes  another  2  seconds  with  unoptimized  code  and  a  SPARC  Ultra- 1.  The  hrst  two 
phases  can  be  performed  offline  in  batch  mode. 
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Table  6:  Results  of  matching  307  duplicate  document  images  against  a  database  of 
approximately  1000  documents. 


Top  1 

Top  2 

Top  6 

Top  10 

Top  20 

Number  of  Correct  Identifications 

286 

296 

298 

302 

307 

%  of  Total 

93.2 

96.4 

97.1 

98.4 

100.0 

5  Experiments  with  real  images 

For  our  experiments  with  real  document  images,  we  used  the  University  of  Washington 
Document  Image  Databases  [4]  in  which  the  data  consist  primarily  of  technical  journal 
image  pages  scanned  from  hrst  and  third  generation  photocopies  of  the  original  docu¬ 
ments,  and  memos  scanned  directly  from  original  documents. 

In  an  initial  experiment,  1035  image  documents  were  inserted  into  a  database.  As 
each  new  document  was  added,  the  top-ranked  20  documents  were  returned  along  with 
their  match  scores.  As  expected,  the  average  match  score  was  low  (20%).  The  average 
difference  between  the  match  scores  of  the  top  two  matched  database  documents  was 
also  low  (about  4%);  thus  no  retrieved  document  stood  out  as  signihcantly  more  similar 
to  the  candidate  document. 

We  then  tested  307  duplicate  documents  which  were  third-generation  photocopies  of 
the  originals^.  Some  of  these  copies  were  signihcantly  degraded.  For  these  duplicate 
documents  the  verihcation  procedure  was  followed,  but  without  adding  the  candidate 
documents  to  the  database.  The  top  20  database  matches  were  determined  for  each 
candidate  document,  along  with  their  match  scores.  The  average  score  for  the  top  match 
was  over  75%  and  the  average  difference  between  the  scores  of  the  top  two  matches  was 
over  48%.  The  average  match  score  for  the  second-ranked  match  was  only  about  20%  and 
similar  to  the  scores  of  the  top-ranked  non-duplicates.  Clearly,  the  duplicates  tended  to 
have  signihcantly  higher  match  scores  than  the  second-ranked  non-duplicate  documents. 
Table  6  shows  the  distribution  of  rankings  for  the  307  duplicates. 

All  of  the  errors  in  the  top-ranked  documents  were  due  to  signihcant  numbers  of 
merged  characters  which  resulted  in  missing  the  representative  line  completely.  We  are 

^These  photocopies  were  also  present  on  the  UWASH  CDROM. 
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testing  an  improved  character  segmentation  scheme  which  is  not  based  entirely  on  white 
space,  bnt  also  nses  character  width  statistics.  We  anticipate  that  this  approach  will 
rednce  the  errors  by  as  mnch  as  75%. 

It  shonld  be  pointed  ont  that  onr  experiments  have  nsed  only  the  shape  code  featnres 
for  docnment  identihcation,  and  have  not  considered  other  featnres  snch  as  the  nnmber 
of  pages  in  the  docnment,  the  nnmber  of  lines  on  a  page,  or  the  density  of  the  page,  for 
example.  A  more  advanced  system  conld  also  make  nse  of  snch  featnres. 

6  Discussion  and  Conclusions 

The  problem  of  duplicate  document  image  detection  is  of  great  importance,  not  only 
within  a  single  database,  but  also  for  image  query  engines  of  the  future  which  operate 
across  multiple  databases. 

If  we  can  reduce  the  number  of  duplicate  document  candidates  to  a  manageable  size, 
more  rehned  algorithms  that  directly  compare  the  images  can  be  used,  or  thumbnails  of 
the  top  20  candidates  can  be  rapidly  presented  to  an  operator  to  verify  that  a  duplicate 
exists.  One  advantage  of  not  using  structural  information  for  indexing  is  that  non¬ 
duplicate  candidates  tend  to  be  visually  very  different  from  the  query  document  and  can 
be  eliminated  rapidly  by  the  operator. 

The  novel  approach  to  the  problem  of  duplicate  document  detection  described  in  this 
paper  shows  great  promise,  as  demonstrated  both  by  the  simulation  results  on  a  million 
documents  (Section  3.2)  and  the  experimental  results  on  a  small  image  database  (Section 
4).  We  can  deal  more  robustly  with  small  image  distortions  and  have  the  advantage  over 
competing  approaches  that  we  do  not  need  any  content-level  information,  either  a  priori 
or  as  part  of  the  analysis  process. 

It  appears  that  a  system  could  be  developed  within  current  limitations  on  system 
resources  which  would  provide  a  cost-effective  solution  to  the  problem.  It  is  estimated 
that  in  some  large  applications,  as  much  as  25%  of  the  cost  could  be  saved  by  identifying 
duplicate  documents. 
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A  Feasibility  Analysis 


This  appendix  presents  a  theoretical  feasibility  analysis  demonstrating  that  the  system 
design  is  realistic  and  that  it  is  robnst  to  anticipated  errors  in  the  signatnre  extraction. 
The  analysis  assnmes  we  have  extracted  a  candidate  signatnre.  The  signatnre  is  a  vector 
of  featnre  valnes  nsed  to  represent  the  docnment,  and  a  key,  possibly  resnlting  from 
a  partitioning  of  the  signatnre,  is  nsed  to  index  into  the  database.  The  parameters 
listed  below  will  be  nsed  in  the  feasibility  and  performance  analysis  of  the  algorithm.  A 
majority  of  the  parameters  reflect  physical  constraints  of  the  system  and  are  necessary 
to  explore  scalability. 

System-Dependent  Parameters 

F  :  Maximnm  allowable  nnmber  of  hies  in  the  operating  system. 

S  :  Maximnm  allowable  size  for  a  single  hie  in  the  operating  system. 

K  :  Main  memory  size  nsed  for  processing  bnckets. 
td  :  Disk  access  time  in  sec. 
tm  :  Memory  access  time  in  sec. 
tt  :  Disk  transfer  rate  in  bytes/sec. 


Algorithm-Dependent  Parameters 

N  :  Nnmber  of  docnments. 

Nj  :  Maximnm  nnmber  of  hies  for  the  index  table. 

Sf  :  Maximnm  hie  size  for  storing  b  bnckets. 

E  :  Size  of  index  table  entries  in  bytes. 
d  :  Average  size  of  docnments  in  bytes. 

/  :  Nnmber  of  bnckets  that  will  be  kept  in  the  main  memory. 
b  :  Nnmber  of  index  table  bnckets  that  will  be  stored  in  a  hie  in  main  memory. 
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Independent  Variables 


a  :  Size  of  alphabet  used  to  construct  the  signature. 
m  :  Size  of  the  signature. 
w  :  Window  size. 

k  :  Number  of  keys  for  a  given  signature  size  (m  —  re  +  t). 


A.l  Analysis  of  Indexing 


The  matching  algorithm  relies  on  an  index  structure  that  is  generated  as  documents 
are  added  to  the  database.  Each  signature  is  partitioned  into  equal-sized  overlapping 
windows  of  size  re,  to  be  used  as  index  keys.  This  partitioning  provides  robustness  in  the 
sense  that  errors  in  the  signature  will  only  be  propagated  within  the  window,  but  the 
smaller  key  size  results  in  a  less  unique  set.  For  a  signature  of  size  m,  =  (m  —  re  +  t) 
possible  keys  (Figure  2  in  the  main  text)  must  be  indexed.  The  index  table  has  on  the 
average  k  X  (^)  entries  per  bucket  with  each  entry  being  the  identiheation  of  a  document 
which  contains  that  key.  Figure  tO  shows  the  index  table  structure. 

An  array  with  a  maximum  size  of  min  {A,  X  (^))}  keeps  a  count  of  the  documents 
which  contain  keys  that  have  matched  to  the  input  key.  When  the  document’s  keys  are 
indexed,  a  counter  is  incremented  for  each  document  which  contains  that  key.  The  most 
frequently  occuring  documents  are  then  identihed  as  candidate  duplicates.  A  further 
level  of  rehnement  can  be  achieved  by  using  a  more  elaborate  matching  algorithm  that 


includes  the  positional  matches  between  the  keys,  but  this  is  not  covered  in  this  analysis. 
The  search  time  for  the  matching  operation  can  be  generalized  as  followst  : 


Worst  case  search  time  =  Bucketsearch  +  Diskoperations 

,  ({,  N  \  {  bxEx^ 

-  kxUkx  —  xtmj  +  \td+ 


If  the  bucket  size  stored  in  the  file  is  greater  than  K ,  then  to  search  the  entire  file  requires  multiple 


24 


Bucket 

Figure  10:  The  index  table. 


Note  that  every  document  signature  has  k  index  terms,  and  each  bucket  has  an 
average  of  (fc  X  ^  X  entries.  Here,  the  disk  access  time  {td)  is  multiplied  by  the 
number  of  windows,  since  in  the  worst  case,  every  indexed  bucket  is  in  a  separate  hie. 
The  transfer  time  (it)  is  the  time  required  for  reading  the  buckets  from  the  disk,  and 
since  disk  operations  are  more  expensive  than  memory  operations,  the  number  of  buckets 
that  are  stored  in  a  single  hie  should  be  optimized  to  decrease  disk  access  time.  We  can 
choose  b  such  that  the  hie  size  for  the  b  buckets  is  not  greater  than  the  main  memory 
size  (/  <  &),  thus  minimizing  the  disk  access  time. 

These  parameters  should  also  be  chosen  not  to  exceed  the  limitations  of  the  operating 
system.  There  is  an  upper  limit  on  the  maximum  allowable  number  of  hies  [F)  and  on 
the  maximum  allowable  hie  size  (N)  in  every  operating  system.  Since  we  have  a  large 
amount  of  data,  the  index  table  structure  should  be  within  these  limits  even  if  we  have 

disk  accesses.  Here  the  disk  access  time  (td)  is  assumed  to  be  9  ms.  The  memory  access  time  (t^)  is 
too  ns  and  the  transfer  rate  (it)  is  assumed  to  be  15  MB  per  sec. 
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to  partition  the  index  table.  The  maximum  nnmber  of  hies  to  store  index  bnckets  is 


Nj  =  ^<F  (2) 

(the  total  nnmber  of  bnckets  (a“')  divided  by  the  nnmber  of  bnckets  in  a  single  hie 
(&))  where  F  is  the  operating  system  restriction  on  the  maximnm  nnmber  of  hies.  The 
maximum  hie  size  is  then 

S,  =  xbxE^<S  (3) 

or  the  nnmber  of  entries  per  bncket  mnltiplied  by  nnmber  of  bnckets  in  each  hie  and 
the  entry  size  in  bytes.  Since  each  bncket  has  a  maximnm  of  entries,  if  we  keep  / 

bnckets  in  memory,  then  the  following  constraint  shonld  be  satished: 

h  Y  N 

(/  X  — ^  xF)<K.  (4) 

The  constrnction  of  the  index  table  shonld  satisfy  (2),  (3)  and  (4).  The  overhead  associ¬ 
ated  with  the  index  table  can  be  estimated  as 

Overhead  =  — - —  (5) 

Note  that  eqnation  (5)  is  independent  of  N,  the  nnmber  of  docnments  in  the  database. 

A. 2  Performance  Analysis 

To  characterize  the  performance  of  a  given  system  we  mnst  hrst  formnlate  several  prob¬ 
abilistic  models  and  dehne  a  design  criterion.  This  is  done  by  nsing  hypothesis  testing 
and  deriving  an  appropriate  distribntion  fnnction  snch  that  specihc  performance  mea- 
snres  can  be  compnted.  For  a  hypothesis  test  consider  two  events,  the  nnll  event  and  the 
alternative  event.  The  Nnll  event,  i/o,  is  that  there  is  no  dnplicate  for  a  given  document, 
and  the  Alternative  event,  i/i,  is  that  there  is  a  dnplicate. 

Ho  :  Nnll  Hypothesis  P{r]\Ho)  (6) 

Hi  :  Alternative  Hypothesis  P{t]\Hi) 

Associated  with  each  hypothesis  is  a  probability  measnre  ry,  the  probability  of  detection. 
For  detection,  a  threshold  is  nsed  to  determine  which  hypothesis  is  valid.  In  this  case 
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T]  represents  a  matching  score  assigned  to  an  observation.  By  choosing  an  appropriate 
threshold  t]t  we  make  the  decision  that  ii  rj  >  t]t  we  accept  the  Alternative  hypothesis, 
and  ii  T]  <  t]t  we  accept  the  Nnll  hypothesis.  The  probability  of  detection  and  the 
probability  of  false  alarm  captnre  the  performance  of  this  detector: 


Pd{vt)  =  P{v  >  Vt\H-^)  (7) 

Pfa(vt)  =  P{v  >  Vt\Ho) 


These  nnmbers  will  be  nsed  as  operating  specihcations  given  varions  parameters,  bnt  to 
fnlly  specify  Pjj  and  PpA  we  need  to  analyze  or  make  assnmptions  abont  the  data  and 
the  matching  processes. 

Given  a  signatnre  of  size  m  from  an  alphabet  of  size  a  and  an  observation  (or  index 
key)  of  size  re,  dehne  Xi  to  be  the  Ah  symbol  and  yj  to  be  the  jth  index  key.  With 
Xi  G  Xi  =  {0,  iy°S2(“)  it  is  easy  to  see  that  all  the  Xj’s  are  independent.  In  fact 

m  m 

P^X  =  x)  =  llPiX,=x,)  forxGX=  [JW  =  {0,fri°s^(“)  (8) 

i=l  i=l 

Let  ns  dehne  yi  =  {xj,  Xj+i, ...,  yi  is  clearly  dependent  on  to  yip^i-i-  As 

described  in  the  previons  sections,  the  y^’s  will  be  nsed  for  indexing  into  a  signatnre 
database  and  the  nnmber  of  hits  determines  rj.  When  indexing,  however,  there  is  no 
dependence  on  the  order  of  the  observed  y^’s,  and  in  a  hypothetical  sitnation  two  docn- 
ments  conld  have  large  nnmbers  of  hits  resnlting  from  matching  y^’s  ont  of  order.  Smaller 
keys  wonld  increase  this  effect  and  larger  keys  wonld  decrease  it;  In  most  cases,  however, 
we  can  argne  that  the  yi  valnes  nsed  to  calcnlate  y  are  independent.  We  shall  assnme 
that  it  is  eqnally  likely  (with  probability  to  observe  any  yi. 

Given  these  conditions,  a  signatnre  contains  a  set  of  featnres  yi  for  i  =  f,2,  ...,fc. 
Since  each  yi  is  now  treated  as  an  independent  observation  and  we  have  k  observations, 
we  can  compnte  the  probability  of  y  matches  ont  of  k  trials  to  be  binomial,  b[k, 


PM 


(9) 


Eqnation  (9)  is  the  probability  of  the  nnmber  of  matches  given  the  Nnll  hypothesis, 
P[y\p[o).  This  probability  measnres  the  relationship  between  different  signatnres  and 
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is  relatively  easy  to  understand  and  compute.  The  assumption  that  the  occurrences  of 
keys  are  equally  likely,  however,  is  unrealistic.  For  example,  it  is  unrealistic  to  have  all 
descenders  in  a  window  observation.  We  therefore  attempt  to  obtain  realistic  probabilities 
of  key  occurrences  empirically  and  revise  the  probability  distribution  function  of  the 
NULL  hypothesis.  The  probability  measure  for  the  Alternative  hypothesis,  however,  is 
not  so  trivial. 

To  derive  the  probability  distribution  of  the  Alternative  hypothesis,  we  must  consider 
what  realistic  errors  may  be  observed  between  a  candidate  signature  and  its  corresponding 
entry  in  the  database.  The  discrepancies  between  observed  and  recorded  signatures  can 
be  characterized  as  insertions,  deletions,  or  substitutions  of  shape  codes.  In  general, 
errors  increase  with  degradation,  so  we  assume  that  the  probability  of  observing  i  errors 
is  a  decaying  exponential  function,  dehned  by  a  decay  rate  (3  .  The  general  form  of  this 
function  is  P{i)  =  P(0)e“^®,  where  P(0)  is  calculated  given  YlT=oP{'^)  —  1-  Calculating 
P(0)  and  formulating  the  distribution  function  we  get 

P(.)  =  f  (0)e»  =  1  J*  ~ ‘(,1.))  e-'’-  for.  =  0,1,...,  m.  (10) 

This  equation  is  approximate  since  this  probability  measure  depends  on  the  accuracy  of 
identifying  shape  codes,  the  statistics  of  shape  codes,  and  the  similarity  of  duplicates 
found  in  the  database,  ft  is,  however,  impossible  to  characterize  the  statistical  nature  of 
these  measures,  so  we  have  simplihed  the  expression  to  an  exponential  function.  In  order 
to  formulate  the  probability  distribution  of  the  Alternate  hypothesis  we  need  to  study 
the  relationship  between  the  number  of  errors  and  the  matching  score.  In  the  case  where 
a  single  error  occurs,  the  error  is  propagated  to  w  of  the  k  index  entries.  This  results 
in  a  matching  score  oi  k  —  w  out  of  a  maximum  of  k.  Although  the  errors  might  occur 
at  either  end  of  the  signature,  and  insertions  and  deletions  may  change  the  matching 
score,  these  occurrences  are  statistically  insignihcant  and  offset  each  other.  For  multiple 
errors,  the  worst-case  scenario  is  when  errors  occur  w  shapes  away  from  each  other  so  the 
errors  propagate  to  the  most  keys.  For  this  worst-case  scenario,  i  errors  will  translate  to 
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a  maximum  matching  score  of 


'^worst  — 


(11) 


k-iw  for  ^  =  0, 1, 

0  otherwise 

For  the  best-case  scenario,  the  errors  could  occur  next  to  each  other,  yielding  a  maximum 
matching  score  of 


"^best  —  k 


W 


w  k  —  i  for  i  =  0,  f , 


(12) 


The  probability  that  worst-case  or  best-case  errors  occur  is  clearly  dependent  on  the 
number  of  errors.  This  is  true  since,  by  dehnition,  i!  =  f  results  in  a  lower  bound  and 
i  =  m  results  in  an  upper  bound  on  the  matching  score,  ft  is  also  true  that  the  score 
probabilities  change  as  a  power  function  of  the  number  of  errors,  and  their  analysis  is 
extremely  complex.  Therefore,  for  simplicity,  we  will  assume  that  the  matching  score  is 
an  average  of  the  worst  and  best  case  scenarios: 


Ty(^)  = 


'^worst  '^best 


k-i±f^  for  *  =  0,f,...,  ^ 


m 


w  i 


(13) 


¥  tor  •  =  [liTlJ  +  1. 

Using  Equations  (fO)  and  (f3),  the  probability  distribution  function  of  the  Alternative 
hypothesis  is  fully  specihed.  We  are  now  able  to  calculate  the  probabilities  of  detection 
and  false  alarm. 

Given  a  threshold  any  document  with  a  higher  score  is  identihed  as  a  duplicate,  as 
shown  in  equation  (7).  We  must  now  apply  this  criterion  to  the  probability  distribution 
functions  of  the  Null  and  Alternative  hypothesis  to  get  the  probability  of  false  alarm  and 
probability  of  detection,  respectively.  Equation  (9)  shows  the  distribution  function  of  the 
Null  hypothesis,  and  summation  over  ry  =  ry^, ...,  k  gives  the  probability  of  false  alarm: 


/ 


Pfa=P{v>Vt\Ho)=  E 


k 


1 


1  - 


1  A*-" 


(14) 


v=VT  ^  ry 

In  order  to  calculate  the  probability  of  detection  we  need  to  consider  the  inverse  of 
equation  (13)  to  be  used  in  equation  (10).  Using  t]t  and  solving  for  i,  the  maximum 
number  of  allowable  errors  such  that  the  effective  score  equals  the  threshold  score: 

fa -IT  >  AA 

k  —  t]t  otherwise 


(15) 
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Figure  11:  Typical  ROC  Curve 

Then  summing  equation  (10)  from  zero  to  imax  gives  the  probability  of  detection: 

Po  =  P(v  >  vtW  =  E  (rv^rs^ny)  (16) 

Given  equations  (f4)  and  (16),  it  is  possible  to  plot  the  probability  of  false  alarm  with 
respect  to  the  probability  of  detection  as  a  function  of  the  threshold  t]t-  This  plot  is 
traditionally  called  the  ROC  (Receiver  Operating  Characteristic)  curve;  it  is  shown  in 
Figure  If  for  a  typical  case.  In  the  pioneering  days  of  radar  technology,  microwave 
engineers  drew  such  curves  for  a  number  of  signal  power  levels  to  determine  the  optimum 
operating  point.  We  will  perform  a  similar  task,  but  instead  of  varying  signal  strength 
we  will  vary  signature  length,  alphabet  size,  and  window  size.  Figure  If  shows  a  diagonal 
line  which  signihes  the  minimum  achievable  performance.  The  diagonal  line  represents 
picking  the  Null  or  the  Alternative  hypothesis  based  on  a  coin  toss;  if  an  algorithm 
performs  below  this  line,  a  coin  toss  would  perform  better. 
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A. 3  Operational  Scenarios 


The  following  example  represents  a  possible  docnment  management  scenario.  If  TV  = 
50  million,  each  docnment  has  an  average  of  fO  pages  and  each  page  (TIFF  hie)  has  an 
average  size  of  lOOKB^  so  that  the  size  of  the  database  is  approximately  50,000  GB, 
assnming  it  is  necessary  to  keep  all  images.  The  index  table  has  size  46  X  50M  X  4Byte  = 
8.6GB  assnming  N  =  50M,  m  =  50,  rc  =  5,  and  a  =  8  since  each  docnment  will  be 
indexed  by  =  46  keys.  This  means  it  is  not  possible  to  store  the  table  in  main  memory. 
A  4  Byte  representation  of  the  docnment  identihcation  nnmber  can  be  nsed  for  the  50M 
docnments.  Fach  of  the  32K  bnckets  in  the  index  table  has  an  average  of  69K  docnment 
identihcation  entries. 

This  scheme  has  two  limitations  that  are  imposed  by  the  UNIX  hie  system.  The  hrst 
is  that  the  size  of  the  index  table  cannot  exceed  the  size  of  main  memory,  so  disk  caching 
is  reqnired.  The  second  limitation  is  that  the  entire  table  cannot  ht  within  a  single  hie 
(8.6  GB),  so  it  mnst  be  divided  into  a  number  of  smaller  hies  that  are  within  the  limits 
of  the  maximum  hie  size  imposed  by  UNIX. 

Table  7  shows  the  index  table  size  and  the  number  of  entries  per  bncket  for  differ¬ 
ent  nnmbers  of  docnments.  To  calcnlate  the  search  time  we  nse  G  =  9  msec,  tm  = 
fOO  nanosec,  and  U  =  15  MB/sec. 


N 

(million) 

Fntries 
per  bncket 

Index  table 
size  in  GB 

Fxpected  search 
time  in  sec 

10.0 

14039 

1.7 

0.64 

20.0 

28077 

3.4 

0.87 

30.0 

42115 

5.1 

1.10 

40.0 

56153 

6.9 

1.33 

50.0 

70191 

8.6 

1.56 

Table  7:  For  m  =  50,  rc  =  5,  a  =  8  the  index  table  has  32K  bnckets. 

Table  8  shows  index  table  characteristics  and  an  estimate  of  the  rnnning  time  of 
the  algorithm  for  different  alphabet  sizes.  In  the  hrst  two  rows,  the  search  time  is 
larger  than  the  other  valnes,  dne  to  the  large  bncket  size.  This  reqnires  mnltiple  disk 
accesses  to  search  all  of  the  bnckets.  However,  in  the  other  cases  the  whole  bncket  can 
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be  stored  in  main  memory,  reqniring  only  one  disk  access.  We  can  see  from  Table  8 
that  a  smaller  bncket  size  is  desired  for  faster  search.  Bnt  this  can  only  be  achieved  by 
increasing  a,  which  means  greater  preprocessing  time  for  extracting  shape  codes,  and 
possibly  decreases  the  robnstness. 


a 

Nnmber 
of  bnckets 

Entries 
per  bncket 

Bncket 
size  (MB) 

Bnckets 
per  hie 

Bncket  hie 
size  (MB) 

Nnmber 
of  hies 

Expected  search 
time  (sec) 

D 

1024 

2246094 

8.6 

1 

8.6 

1024 

38.68 

H 

3125 

736000 

2.8 

1 

2.8 

3125 

12.82 

11 

7776 

295782 

1.1 

1 

1.1 

7776 

5.23 

H 

16807 

136848 

0.5 

3 

1.6 

5603 

2.64 

H 

32768 

70191 

0.3 

7 

1.9 

4682 

1.56 

Table  8:  Index  table  characteristics  for  w  =  5,  m  =  50,  N  =  50M,  K  =  2MB 


Table  9  shows  index  table  characteristics  and  an  estimate  of  the  rnnning  time  of  the 
algorithm  for  different  valnes  of  w.  The  number  of  docnments  is  50M,  m  =  50,  a  =  8. 
The  hrst  two  cases  have  signihcant  search  time  ranges  dne  to  the  large  bncket  size.  We 
can  see  from  Table  9  that  the  search  time  decreases  as  w  gets  larger,  bnt  a  larger  w 
means  a  more  error-prone  system. 


w 

Nnmber 
of  bnckets 

Entries 
per  bncket 

Bncket 
size  (MB) 

Bnckets 
per  hie 

Bncket  hie 
size  (MB) 

Nnmber 
of  hies 

Expected  search 
time  (sec) 

3 

512 

4492188 

17.136 

1 

17.1 

512 

76.94 

4 

4096 

561524 

2.142 

1 

2.1 

4096 

9.98 

5 

32768 

70191 

0.268 

7 

1.9 

4682 

1.56 

6 

262144 

8774 

0.033 

59 

2.0 

4444 

0.56 

7 

2097152 

1097 

0.004 

477 

2.0 

4397 

0.43 

8 

16777216 

138 

0.001 

3799 

2.0 

4417 

0.42 

Table  9:  Index  table  characteristics  for  a  =  8,  m  =  50,  N  =  50M,  K  =  2MB 


Table  10  shows  index  table  characteristics  and  a  rongh  estimate  of  the  rnnning  time 
of  the  algorithm  for  different  valnes  of  m.  The  nnmber  of  docnments  is  50M,  re  =  5, 
a  =  8.  As  m  gets  larger,  the  hash  table  size  increases,  which  increases  search  time. 

By  looking  at  Tables  8-10  and  the  ROC  enrves  in  Fignres  12-14  we  can  hnd  “good” 
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m 

Number 
of  buckets 

Fntries 
per  bucket 

Bucket 
size  (MB) 

Buckets 
per  hie 

Bucket  hie 
size  (MB) 

Number 
of  hies 

Fxpected  earch 
time  (sec) 

50 

32768 

70191 

0.268 

7 

1.9 

4682 

1.56 

60 

32768 

85450 

0.326 

6 

2.0 

5462 

2.20 

70 

32768 

100709 

0.384 

5 

1.9 

6554 

2.95 

80 

32768 

115967 

0.442 

4 

1.8 

8192 

3.81 

90 

32768 

131226 

0.501 

3 

1.5 

10923 

4.77 

100 

32768 

146485 

0.559 

3 

1.7 

10923 

5.85 

Table  10:  Index  table  characteristics  for  a  =  8,  rc  =  5,  IV  =  50M,  K  =  2MB 

values  for  the  model  parameters  for  the  case  of  50  million  documents.  From  Table  8  and 
Figure  12  we  can  see  that  although  an  increasing  a  has  a  great  effect  on  the  search  time, 
its  effect  on  the  detection  probability  is  not  signihcant.  In  that  sense,  choosing  a  proper 
value  for  a  will  determine  search  time  without  effecting  performance.  The  small  values 
of  m  are  better  for  both  search  time  and  detection  probability.  Choosing  a  best  value 
for  w  requires  more  thought,  since  detection  probability  is  inversely  proportional  to  both 
search  time  and  w.  Picking  a  value  for  w  is  thus  a  design  issue:  giving  preference  to 
search  time  or  to  probability  of  detection. 

To  summarize,  the  effect  of  a,  the  size  of  the  alphabet,  can  be  seen  from  Figure  12. 
The  increasing  value  of  a  does  not  change  the  probability  of  detection  signihcantly.  But 
from  Table  8  we  can  see  that  the  effect  of  a  on  the  search  time  is  great. 

Figure  13  shows  the  effect  of  changing  re,  the  window  size.  Increasing  w  makes  the 
detection  probability  worse.  However  a  very  small  value  (3,4)  for  w  increases  the  search 
time,  since  the  bucket  size  increases  signihcantly,  requiring  multiple  disk  accesses. 

Figure  14  shows  the  effect  of  changing  m,  the  signature  size.  Increasing  m  makes  the 
detection  probability  higher. 

The  surfaces  dehned  by  search  time,  detection  probability  and  false  alarm  probability 
can  help  the  designer  visualize  the  “relative  goodness”  of  the  operating  point  (Figure 
15).  For  detection  we  can  see  that  lower  values  of  w  and  higher  values  of  m  are  desired. 
However,  we  can  see  that  the  probability  of  false  alarm  increases  for  low  values  of  w  and 
high  values  of  m.  The  search  time  favors  high  values  of  w  and  low  values  of  m. 
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Figure  14:  ROC  Curves  for  (3  =  0.01  ru  =  5,  a  =  8  and  various  values  of  m  (50,. ..,100). 
A. 4  System  Design 

Our  model  has  two  main  profiles:  performance  and  system  resources.  The  performance 
can  be  characterized  by  specifying  the  error  tolerance.  The  system  resources  can  be 
thought  of  as  dehning  constraints  on  the  desired  performance.  It  is  hard  to  hnd  a  for¬ 
mula  that  can  give  us  optimum  values  of  the  model  parameters  (m,  re,  a)  for  given  system 
resources  and  performance  range.  But  we  can  at  least  give  a  recipe  for  hnding  “good”  val¬ 
ues  for  the  parameters.  Once  the  system  resources  are  characterized,  making  tables  (like 
Tables  8  and  9)  for  different  combinations  of  the  parameters  (m,w,a),  and  considering 
constraint  equations  (2-4)  for  different  values  of  these  parameters,  helps  us  understand 
the  possible  ranges  of  the  parameters.  Then,  by  drawing  ROC  curves  and  looking  for  a 
desired  detection  probability  range,  we  can  shrink  the  ranges  of  the  parameters  further. 

Let  us  comment  on  each  parameter’s  effect.  The  increase  in  the  value  of  m  (the  size 
of  the  signature)  increases  the  search  time  and  the  size  of  the  index  table.  However,  the 
probability  of  duplicate  detection  becomes  higher  (Figure  14). 

The  increase  in  the  value  of  w  (the  window  size)  makes  the  detection  probability 
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(a) 


(b) 


(c) 


Figure  15:  (a)  Probability  of  detection,  (b)  probability  of  false  alarm,  and  (c)  search  time 
as  functions  of  m  and  w. 
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lower  (Figure  13).  We  can  conclude  from  Figures  13  and  14  that  only  k  =  m  —  w  1 
(the  number  of  keys)  matters  for  the  detection  probability,  because  increasing  w  means 
decreasing  fc,  which  lowers  the  detection  probability. 

The  value  of  a  (the  alphabet  size)  has  no  signihcant  effect  on  the  detection  probability 
(Figure  12).  But  increasing  it  improves  the  detection  probability,  since  the  false  alarm 
rate  drops. 

The  choice  of  a  and  w  is  crucial  for  the  number  of  entries  per  bucket.  The  number  of 
entries  per  bucket  affects  the  search  time,  and  as  the  search  time  increases,  the  number 
of  entries  increases.  As  re  or  a  increases,  the  number  of  entries  per  bucket  decreases,  but 
the  errors  increase.  Smaller  values  of  w  are  desirable  for  robustness. 
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